43 research outputs found

    An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support

    Full text link
    Consider the problem of a controller sampling sequentially from a finite number of Nβ‰₯2N \geq 2 populations, specified by random variables XkiX^i_k, i=1,…,N, i = 1,\ldots , N, and k=1,2,…k = 1, 2, \ldots; where XkiX^i_k denotes the outcome from population ii the kthk^{th} time it is sampled. It is assumed that for each fixed ii, {Xki}kβ‰₯1\{ X^i_k \}_{k \geq 1} is a sequence of i.i.d. uniform random variables over some interval [ai,bi][a_i, b_i], with the support (i.e., ai,bia_i, b_i) unknown to the controller. The objective is to have a policy Ο€\pi for deciding, based on available data, from which of the NN populations to sample from at any time n=1,2,…n=1,2,\ldots so as to maximize the expected sum of outcomes of nn samples or equivalently to minimize the regret due to lack on information of the parameters {ai}\{ a_i \} and {bi}\{ b_i \}. In this paper, we present a simple inflated sample mean (ISM) type policy that is asymptotically optimal in the sense of its regret achieving the asymptotic lower bound of Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.Comment: arXiv admin note: text overlap with arXiv:1504.0582

    Asymptotically Optimal Sequential Experimentation Under Generalized Ranking

    Full text link
    We consider the \mnk{classical} problem of a controller activating (or sampling) sequentially from a finite number of Nβ‰₯2N \geq 2 populations, specified by unknown distributions. Over some time horizon, at each time n=1,2,…n = 1, 2, \ldots, the controller wishes to select a population to sample, with the goal of sampling from a population that optimizes some "score" function of its distribution, e.g., maximizing the expected sum of outcomes or minimizing variability. We define a class of \textit{Uniformly Fast (UF)} sampling policies and show, under mild regularity conditions, that there is an asymptotic lower bound for the expected total number of sub-optimal population activations. Then, we provide sufficient conditions under which a UCB policy is UF and asymptotically optimal, since it attains this lower bound. Explicit solutions are provided for a number of examples of interest, including general score functionals on unconstrained Pareto distributions (of potentially infinite mean), and uniform distributions of unknown support. Additional results on bandits of Normal distributions are also provided

    Asymptotic Behavior of Minimal-Exploration Allocation Policies: Almost Sure, Arbitrarily Slow Growing Regret

    Full text link
    The purpose of this paper is to provide further understanding into the structure of the sequential allocation ("stochastic multi-armed bandit", or MAB) problem by establishing probability one finite horizon bounds and convergence rates for the sample (or "pseudo") regret associated with two simple classes of allocation policies Ο€\pi. For any slowly increasing function gg, subject to mild regularity constraints, we construct two policies (the gg-Forcing, and the gg-Inflated Sample Mean) that achieve a measure of regret of order O(g(n)) O(g(n)) almost surely as nβ†’βˆžn \to \infty, bound from above and below. Additionally, almost sure upper and lower bounds on the remainder term are established. In the constructions herein, the function gg effectively controls the "exploration" of the classical "exploration/exploitation" tradeoff

    Inventory Control Involving Unknown Demand of Discrete Nonperishable Items - Analysis of a Newsvendor-based Policy

    Full text link
    Inventory control with unknown demand distribution is considered, with emphasis placed on the case involving discrete nonperishable items. We focus on an adaptive policy which in every period uses, as much as possible, the optimal newsvendor ordering quantity for the empirical distribution learned up to that period. The policy is assessed using the regret criterion, which measures the price paid for ambiguity on demand distribution over TT periods. When there are guarantees on the latter's separation from the critical newsvendor parameter Ξ²=b/(h+b)\beta=b/(h+b), a constant upper bound on regret can be found. Without any prior information on the demand distribution, we show that the regret does not grow faster than the rate T1/2+Ο΅T^{1/2+\epsilon} for any Ο΅>0\epsilon>0. In view of a known lower bound, this is almost the best one could hope for. Simulation studies involving this along with other policies are also conducted

    Dynamic Pricing in a Dual Market Environment

    Full text link
    This paper is concerned with the determination of pricing strategies for a firm that in each period of a finite horizon receives replenishment quantities of a single product which it sells in two markets, e.g., a long-distance market and an on-site market. The key difference between the two markets is that the long-distance market provides for a one period delay in demand fulfillment. In contrast, on-site orders must be filled immediately as the customer is at the physical on-site location. We model the demands in consecutive periods as independent random variables and their distributions depend on the item's price in accordance with two general stochastic demand functions: additive or multiplicative. The firm uses a single pool of inventory to fulfill demands from both markets. We investigate properties of the structure of the dynamic pricing strategy that maximizes the total expected discounted profit over the finite time horizon, under fixed or controlled replenishment conditions. Further, we provide conditions under which one market may be the preferred outlet to sale over the other

    Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem

    Full text link
    Consider the problem of sampling sequentially from a finite number of Nβ‰₯2N \geq 2 populations, specified by random variables XkiX^i_k, i=1,…,N, i = 1,\ldots , N, and k=1,2,…k = 1, 2, \ldots; where XkiX^i_k denotes the outcome from population ii the kthk^{th} time it is sampled. It is assumed that for each fixed ii, {Xki}kβ‰₯1\{ X^i_k \}_{k \geq 1} is a sequence of i.i.d. normal random variables, with unknown mean ΞΌi\mu_i and unknown variance Οƒi2\sigma_i^2. The objective is to have a policy Ο€\pi for deciding from which of the NN populations to sample form at any time n=1,2,…n=1,2,\ldots so as to maximize the expected sum of outcomes of nn samples or equivalently to minimize the regret due to lack on information of the parameters ΞΌi\mu_i and Οƒi2\sigma_i^2. In this paper, we present a simple inflated sample mean (ISM) index policy that is asymptotically optimal in the sense of Theorem 4 below. This resolves a standing open problem from Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.Comment: 15 pages 3 figure

    Optimal Data Driven Resource Allocation under Multi-Armed Bandit Observations

    Full text link
    This paper introduces the first asymptotically optimal strategy for a multi armed bandit (MAB) model under side constraints. The side constraints model situations in which bandit activations are limited by the availability of certain resources that are replenished at a constant rate. The main result involves the derivation of an asymptotic lower bound for the regret of feasible uniformly fast policies and the construction of policies that achieve this lower bound, under pertinent conditions. Further, we provide the explicit form of such policies for the case in which the unknown distributions are Normal with unknown means and known variances, for the case of Normal distributions with unknown means and unknown variances and for the case of arbitrary discrete distributions with finite support.Comment: arXiv admin note: text overlap with arXiv:1509.0285

    Cash-Flow Based Dynamic Inventory Management

    Full text link
    Small-to-medium size enterprises (SMEs), including many startup firms, need to manage interrelated flows of cash and inventories of goods. In this paper, we model a firm that can finance its inventory (ordered or manufactured) with loans in order to meet random demand which in general may not be time stationary. The firm earns interest on its cash on hand and pays interest on its debt. The objective is to maximize the expected value of the firm's %working capital at the end of a finite planning horizon. Our study shows that the optimal ordering policy is characterized by a pair of threshold variables for each period as function of the initial state of the period. Further, upper and lower bounds for the threshold values are developed using two simple-to-compute ordering policies. Based on these bounds, we provide an efficient algorithm to compute the two threshold values. Since the underlying state space is two-dimensional which leads to high computational complexity of the optimization algorithm, we also derive upper bounds for the optimal value function by reducing the optimization problem to one dimension. Subsequently, it is shown that policies of similar structure are optimal when the loan and deposit interest rates are piecewise linear functions, when there is a maximal loan limit and when unsatisfied demand is backordered. Finally, further managerial insights are provided with numerical studies

    A Comparative Analysis of the Successive Lumping and the Lattice Path Counting Algorithms

    Full text link
    This article provides a comparison of the successive lumping (SL) methodology with the popular lattice path counting algorithm in obtaining rate matrices for queueing models, satisfying the quasi birth and death structure. The two methodologies are compared both in terms of applicability requirements and numerical complexity by analyzing their performance for the same classical queueing models. The main findings are: i) When both methods are applicable SL based algorithms outperform the lattice path counting algorithm (LPCA). ii) There are important classes of problems (e.g., models with (level) non-homogenous rates or with finite state spaces) for which the SL methodology is applicable and for which the LPCA cannot be used. iii) Another main advantage of successive lumping algorithms over LPCAs is that the former includes a method to compute the steady state distribution using this rate matrix

    On the Solution to a Countable System of Equations Arising in Stochastic Processes

    Full text link
    In this paper we develop a method to compute the solution to a countable (finite or infinite) set of equations that occurs in many different fields including Markov processes that model queueing systems, birth-and-death processes and inventory systems. The method provides a fast and exact computation of the inverse of the matrix of the coefficients of the system. In contrast, alternative inverse techniques perform much slower and work only for finite size matrices. Furthermore, we provide a procedure to construct the eigenvalues of the matrix under consideration
    corecore